Dataframes

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Warm up
  • What is a dataframe?
  • Observations
  • Variables
  • Recoding variables
  • Scope of data

Warm up

R learning curve

What is a dataframe?

Table

It s a generic name. It can be almost anything.

  • Periodic table
  • Multiplication table
  • Truth table
  • Chi squared table
  • Phonetic table

Data(s)

  • Source of information (SI): Raw empirical material.
  • Data (s/p): Collected, processed, systematized and organized SI (Van Evera 2009).
    • Numbers, characters, symbols … no meaning.
  • Database: An organized collection of data stored and accessed electronically / An organized collection of data stored as multiple datasets.
  • Dataset: A structured collection of data generally associated with a unique body of work.

Spreadsheet

How Excel stores data in two dimensions:

Dataframe

A way1 to store data in R in two dimensions: rows and columns2:

# A tibble: 17,548 × 9
   scode country      year polity2 xrreg xrcomp xropen xconst parreg
   <chr> <chr>       <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AFG   Afghanistan  1800      -6     3      1      1      1      3
 2 AFG   Afghanistan  1801      -6     3      1      1      1      3
 3 AFG   Afghanistan  1802      -6     3      1      1      1      3
 4 AFG   Afghanistan  1803      -6     3      1      1      1      3
 5 AFG   Afghanistan  1804      -6     3      1      1      1      3
 6 AFG   Afghanistan  1805      -6     3      1      1      1      3
 7 AFG   Afghanistan  1806      -6     3      1      1      1      3
 8 AFG   Afghanistan  1807      -6     3      1      1      1      3
 9 AFG   Afghanistan  1808      -6     3      1      1      1      3
10 AFG   Afghanistan  1809      -6     3      1      1      1      3
# … with 17,538 more rows

Tidy data

We consider that a dataframe is tidy if it fulfills the following requirements (Wickham 2014):

  • Each dataframe has one unit of observation.
  • Observations are represented in the rows.
  • Variables are represented in the columns.
  • Each cell indicates a value.

RStudio workflow

Load packages.

library(dplyr)
library(readr)
library(stringr)
library(forcats)

Observations

Observing …

We need to decide which are the units of interest.

What is an observation?

  • Unit of analysis: The thing that we want to know about.
    • Determined by the hypothesis / question.
  • Unit of observation: Each row of a dataframe.
    • Determined by the data.

Ethnic Power Relations, International Conflict Research.

# A tibble: 14 × 5
   countryname  year groupname statusname     groupsize
   <chr>       <dbl> <chr>     <chr>              <dbl>
 1 Belgium      1967 Flemings  JUNIOR PARTNER     0.59 
 2 Belgium      1967 Walloon   SENIOR PARTNER     0.4  
 3 Belgium      1967 Germans   IRRELEVANT         0.01 
 4 France       1967 French    MONOPOLY           0.976
 5 France       1967 Basques   POWERLESS          0.013
 6 France       1967 Corsicans POWERLESS          0.004
 7 France       1967 Roma      DISCRIMINATED      0.006
 8 Belgium      1968 Flemings  JUNIOR PARTNER     0.59 
 9 Belgium      1968 Walloon   SENIOR PARTNER     0.4  
10 Belgium      1968 Germans   IRRELEVANT         0.01 
11 France       1968 French    MONOPOLY           0.976
12 France       1968 Basques   POWERLESS          0.013
13 France       1968 Corsicans POWERLESS          0.004
14 France       1968 Roma      DISCRIMINATED      0.006

Levels of analysis

  • Macro level: States, regions, legal systems.
  • Meso level: Organitzations, ethnic groups, political parties.
  • Micro level: Families, individuals, relationships.
    • Events: Bombings, contracts, terrorist attacks.
# A tibble: 477 × 8
   cowcode region  year country    no  coup successful combat
     <dbl>  <dbl> <dbl> <chr>   <dbl> <dbl>      <dbl>  <dbl>
 1      40      5  1952 Cuba        1     1          1      1
 2      40      5  1957 Cuba        1     1          0      1
 3      41      5  1950 Haiti       1     1          1      0
 4      41      5  1956 Haiti       1     1          0      0
 5      41      5  1957 Haiti       1     1          1      0
 6      41      5  1957 Haiti       2     1          1      0
 7      41      5  1957 Haiti       3     1          1      0
 8      41      5  1958 Haiti       1     1          0      1
 9      41      5  1970 Haiti       1     1          0      0
10      41      5  1986 Haiti       1     1          1      0
# … with 467 more rows

Coup Agency and Mechanisms Dataset

Ecological fallacy

When the UA and the UO are not the same, we run the risk of having an ecological fallacy problem.

Ecological fallacy

Barcelona local elections: District level.

Ecological fallacy

Barcelona local elections: Neighbourhood level.

Ecological fallacy

Barcelona local elections: Census section level.

Variables

What is a variable?

A characteristic of the object we’re studying.

  • It varies across units.
# A tibble: 6 × 5
  region municipality            religion   population suicide
  <chr>  <chr>                   <chr>           <dbl>   <dbl>
1 Isère  Grenoble                Protestant       8250     520
2 Isère  Grenoble                Catholic         1080      72
3 Isère  Le Bourg-d'Oisans       Protestant        325      12
4 Isère  Le Bourg-d'Oisans       Catholic          593      20
5 Isère  Saint-Jean-de-Maurienne Protestant        181       5
6 Isère  Saint-Jean-de-Maurienne Catholic          392      11

Types of variables (I): Nominal

Unordered categories:

  • Municipality: Barcelona, Sant Cugat, Granollers…
  • Religion: Muslim, Catholic, Shinto…
  • Language: Russian, Catalan, Swedish.
  • Ideology: Conservatism, Nationalism, Liberal…
  • Political parties: PSOE, PP, Cs, ERC…

For strings, stringr (Wickham 2022).

Types of variables (II): Ordinal

Ordered categories:

  • Things: Small, Medium, Large.
  • Age: Child, Young, Adult.
  • Education: Primary, Secondary, Tertiary.
  • Ideas: Disagree, Neutral, Agree.

For factors, forcats (Wickham 2021).

Types of variables (III): Interval

Numbers, zero is arbitrary.

  • Year: 2004, 2005, 2008, 2010.
  • Temperature (except Kelvin): 10, 25, 30.
  • Ideology: Left-right measured as 0-10.
  • Coordinates: Longitude and latitude.

Data available: Polity V.

Types of variables (IV): Ratio

Numbers, zero has meaning

  • Age: 2, 5, 7, 9.
  • Percentages: 0%, 34%, 100%.
  • Population: 200, 3345000, 13000000.
  • Indices (not all of them): 0.245, 0.999.

Data available: National Material Capabilities (NMC) dataset (Singer 1987; Singer and Small 1972).

Types of variables (V): Summary

Tipus Característiques Vector Operacions
Categòrica nominal Categories no ordenables Caràcter o factor ==, !=
Categòrica ordinal Categories ordenables Factor ==, !=, <=, <, >, >=
Numèrica d’interval Nombres, zero sense significat Numèric o enter ==, !=, <=, <, >, >=, +, -
Numèrica de ràtio Nombres, zero amb significat Numèric ==, !=, <=, <, >, >=, +, -, *, / …

Recoding variables

Boolean operators

  • AND (&): TRUE if all conditions are met.
  • OR (|): TRUE if any condition is met.
  • NOT (!): TRUE if conditions are not met.

Summary recoding

Destí Funció
Binària if_else()
Categòrica case_when()
Ordinal factor()
Qualsevol recode()
Altres as.numeric(), as.character(), as.Date(), etc.

Bibliography

Singer, J. David. 1987. Reconstructing the Correlates of War Dataset on Material Capabilities of States, 1816-1985.” International Interactions 14: 115–32.
Singer, J. David, and Melvin Small. 1972. The wages of war, 1816-1965: a statistical handbook. New York: Wiley.
Van Evera, Stephen. 2009. Guía para Estudiantes de Ciencia Política: Métodos y Recursos. Barcelona: Gedisa.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.
———. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.